-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fastcdc #11
Fastcdc #11
Conversation
This is on top of #10 . Still a WIP - I need to confirm that it actually works as expected etc. |
@RAOF You've mentioned that you're planing to work on FastCDC. Currently, I'm not even sure if I correctly understood the whitepaper, but here it is, in case you're interested. |
51a067c
to
4b9f37e
Compare
For reasons unknown, this seed generates weird results.
Another implementation - https://github.com/dswd/zvault/blob/master/chunking/src/fastcdc.rs |
Oh, nice! There was so much stuff that I was unclear about, and I think here it actually might be right. It doesn't look to me it is going to be fast though. The API enforces copying, and nested conditions inside a loop: https://github.com/dswd/zvault/blob/master/chunking/src/fastcdc.rs#L103 are probably not going to get optimized away. As far as I understand the bigest chunk of speedup was being able to skip rolling through the initial bytes. Normalized Chunking was supposed to help counter-effect the deduplication ratio loss, without affecting the speedup. At very minimum, I should write a test that compares the two implementations, and then beat the code into submission until both versions yield the same results. :) @dswd I need more time to review but in the meantime: https://github.com/dswd/zvault/blob/master/chunking/src/fastcdc.rs#L24 - what is the deal with these masks? In Ddelta paper Gear was described as just using most significant bits, so why can't it be used in FastCDC? What do you think about statically including the lookup table? It's just 2K. |
Hi, I reviewed my code and implemented some improvements
On the masks: The paper describes the masks for 8k avg size and mentions that spreading the 1 bits out in the masks yields a slightly better deduplication. So that is what I am doing in that method: I am using a simple PSRNG seeded to the given seed to set the correct number of random bits on the masks. The major thing here is that the masks depend on the seed, otherwise you could just use a static one. On static lookup table: I can't do this as my table is seeded, but I don't see a problem with a static table if you do not need this feature. However, the computing effort in calculating the table is negligible and I doubt it will yield any improvement in accessing it during chunking. On the performance: On my laptop, my optimized code reaches over 1000 MB/s while the code of this PR only reaches below 500 MB/s (my previous unoptimized code had the same speed). I don't know exactly why but I suspect that my dummy sink allows the compiler to omit the costly memcpy calls. Surely this is not the most elegant way to get the positions but it seems fast enough. I have no clue why your code is so much slower; it should be at least as fast as mine. |
Very interesting! We should definitely investigate how far we can push our two versions. :) I glanced quickly https://github.com/dswd/zvault/blob/master/chunking/benches/all.rs#L107 , and I am suspicious that Similiarly in my benchmarks, there is such risk, but since it is not that much faster, I haven't thought about it. At the bare minimum we should put the results into https://doc.rust-lang.org/1.1.0/test/fn.black_box.html so prevent (somewhat, helpfully) unfair opitmizations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits, and suggestion for later refactoring
if self.current_chunk_size < avg_size { | ||
let roll_bytes = cmp::min(avg_size - self.current_chunk_size, | ||
buf.len() as u64); | ||
let result = self.gear.find_chunk_edge_cond(buf, |e: &Gear| (e.digest() >> min_shift) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing the padding-zeros optimisation for deduplication efficiency (likewise the large-chunk calculation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get the padding zeros. Since the underlying Gear
(as described in Ddelta paper) is taking most significant bits, isn't that the ultimate padding? I was so confused but this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, you're quite right. I've misread this; the window here is essentially the width of the digest; 64 bytes.
The authors of the paper describe the mask they use in the algorithm as being empirically derived, but infuriatingly not giving any details about it. You'd think that taking the largest window would be best, but apparently not? Apparently it works best when the contributing bits are split approximately uniformly across the 64bit digest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it works best when the contributing bits are split approximately uniformly across the 64bit digest?
But why? :D I am not very academically minded person, but I found this paper rather confusing is many places. A lot repeating the obvious, and glancing over the important details.
Well, @dswd has the correct implementation https://github.com/dswd/zvault/blob/master/chunking/src/fastcdc.rs here, so we can just use it. :)
|
||
let mut cur_offset = 0usize; | ||
|
||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this loop for? As far as I can tell it will loop exactly once, because all codepaths return?
|
||
|
||
impl Engine for FastCDC { | ||
type Digest = u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really emphasises that trait Engine
is an incorrect abstraction for CDC. FastCDC
doesn't actually have a Digest, nor can you roll a single byte.
(Similarly, AE and MAXP can't even pretend to have a digest, because they're not even approximately hash-based)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. However Engine
was all we have ATM. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point, I had only been looking at rolling checksums (hence the crate name) and hadn't really thought about non-checksum based alternatives for doing chunking.
That's an excellent question! This appears to be a recurring problem in CDC papers I've read. It would have been really nice to see their derivation of their masks - what did they try? What was the range of deduplication efficiencies? How stable is their choice - is there a dataset upon which it is really good, and others on which it's mediocre, or is it reliably good for all datasets they tried? 🤷♀️ |
My new code is using black_box which does not seem to change the performance. Also I changed my dummy sink to calculate the split positions to be comparable to your code. Speaking about the curios masks, I am also very unsure why they want to distribute the bits uniformly. It seems they do not really want to have a 64 byte window. |
|
||
// ignore bytes that are not going to influence the digest | ||
if self.current_chunk_size < ignore_size { | ||
let skip_bytes = cmp::min(ignore_size - self.current_chunk_size, buf.len() as u64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this min
s outside of condition might perform better. Like here: https://github.com/dswd/zvault/blob/master/chunking/src/fastcdc.rs#L99
@dswd I will investigate your code and benchmarks more at some point, but as it is right now I actually like your code somewhat better anyway. :D The one thing that I don't like are these Also, in my implementation, I do roll through the 64 bytes before In my version I use 8 for both OK. Will get back to it at some point. :) |
The I thought about including data before In my experiments, skipping the |
Just a minor note - please don't copy code directly from @dswd as our licenses are not compatible (assuming the chunking code has the same license as zvault itself) :) |
The chunking code, especially the fastcdc part, is not that much code, so I don't object to MIT/BSD licenses. I hereby grant permission to use my chunking code (https://github.com/dswd/zvault/tree/master/chunking) under the terms of either the MIT license or BSD license. If you plan a combined crate with lots of chunking algorithms, I am also happy to move my code there and use that crate as dependency as long as it provides an efficient interface for my use case. |
The above are inner-loops from Mine :
@dswd's:
In mine code compiler fails to use register Also there is an additional
|
The good part is that 2100MB/s seems doable (i was afraid it's just a misunderstanding of some kind), so it will push the bottleneck that I see often in rdedup quite far away. |
|
Since I've failed to find a way to get the compiler to optimize optimally, I've filled: rust-lang/rust#43653 |
The fix is relatively simple, at least for Before:
After:
Comparing to
The difference (I'm guessing) is because fastcdc is generaly faster because it's skipping some data, which gear is not doing. I'll redo |
Reminder to self: there is one |
I have reworked and optimized things a bit more in #14 reaching a speed almost as good as @dsws 's code:
However, I've filled dswd/zvault#10 to show the cost of API that involves temporary copying. |
Closing this one, we can continue in #14 |
No description provided.